Project: Wrangling and Analyze Data¶

In [1]:
import numpy as np
import pandas as pd
import requests
import tweepy
from tweepy import OAuthHandler
import json
from timeit import default_timer as timer
In [2]:
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_columns', None)
pd.options.plotting.backend = "plotly"

Data Gathering¶

In the cell below, gather all three pieces of data for this project and load them in the notebook. Note: the methods required to gather each data are different.

  1. Directly download the WeRateDogs Twitter archive data (twitter_archive_enhanced.csv)
In [3]:
archive = pd.read_csv('twitter-archive-enhanced.csv')
  1. Use the Requests library to download the tweet image prediction (image_predictions.tsv)
In [4]:
url = ('https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-'
       'predictions.tsv')
response = requests.get(url)

with open('image-predictions.tsv', mode='wb') as file:
    file.write(response.content)

img_predictions = pd.read_csv('image-predictions.tsv', sep = '\t')

Note: The read_csv function can read TSVs straight from URL.

img_predictions = pd.read_csv(url, sep = '\t')
  1. Use the Tweepy library to query additional data via the Twitter API (tweet_json.txt)
# Query Twitter API for each tweet in the Twitter archive and save JSON in a text file
# These are hidden to comply with Twitter's API terms and conditions
consumer_key = 'HIDDEN'
consumer_secret = 'HIDDEN'
access_token = 'HIDDEN'
access_secret = 'HIDDEN'

auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth, wait_on_rate_limit=True)

# NOTE TO STUDENT WITH MOBILE VERIFICATION ISSUES:
# df_1 is a DataFrame with the twitter_archive_enhanced.csv file. You may have to
# change line 17 to match the name of your DataFrame with twitter_archive_enhanced.csv
# NOTE TO REVIEWER: this student had mobile verification issues so the following
# Twitter API code was sent to this student from a Udacity instructor
# Tweet IDs for which to gather additional data via Twitter's API
tweet_ids = df_1.tweet_id.values
len(tweet_ids)

# Query Twitter's API for JSON data for each tweet ID in the Twitter archive
count = 0
fails_dict = {}
start = timer()
# Save each tweet's returned JSON as a new line in a .txt file
with open('tweet_json.txt', 'w') as outfile:
    # This loop will likely take 20-30 minutes to run because of Twitter's rate limit
    for tweet_id in tweet_ids:
        count += 1
        print(str(count) + ": " + str(tweet_id))
        try:
            tweet = api.get_status(tweet_id, tweet_mode='extended')
            print("Success")
            json.dump(tweet._json, outfile)
            outfile.write('\n')
        except tweepy.TweepError as e:
            print("Fail")
            fails_dict[tweet_id] = e
            pass
end = timer()
print(end - start)
print(fails_dict)
In [5]:
api_data = pd.read_json('tweet_json.txt', lines=True)
api_data = api_data[['id','retweet_count','favorite_count']]

Assessing Data¶

In this section, detect and document at least eight (8) quality issues and two (2) tidiness issue. You must use both visual assessment programmatic assessement to assess the data.

Note: pay attention to the following key points when you access the data.

  • You only want original ratings (no retweets) that have images. Though there are 5000+ tweets in the dataset, not all are dog ratings and some are retweets.
  • Assessing and cleaning the entire dataset completely would require a lot of time, and is not necessary to practice and demonstrate your skills in data wrangling. Therefore, the requirements of this project are only to assess and clean at least 8 quality issues and at least 2 tidiness issues in this dataset.
  • The fact that the rating numerators are greater than the denominators does not need to be cleaned. This unique rating system is a big part of the popularity of WeRateDogs.
  • You do not need to gather the tweets beyond August 1st, 2017. You can, but note that you won't be able to gather the image predictions for these tweets since you don't have access to the algorithm used.
In [6]:
archive.head()
Out[6]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo
0 892420643555336193 NaN NaN 2017-08-01 16:23:56 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU NaN NaN NaN https://twitter.com/dog_rates/status/892420643555336193/photo/1 13 10 Phineas NaN NaN NaN NaN
1 892177421306343426 NaN NaN 2017-08-01 00:17:27 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10 https://t.co/0Xxu71qeIV NaN NaN NaN https://twitter.com/dog_rates/status/892177421306343426/photo/1 13 10 Tilly NaN NaN NaN NaN
2 891815181378084864 NaN NaN 2017-07-31 00:18:03 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> This is Archie. He is a rare Norwegian Pouncing Corgo. Lives in the tall grass. You never know when one may strike. 12/10 https://t.co/wUnZnhtVJB NaN NaN NaN https://twitter.com/dog_rates/status/891815181378084864/photo/1 12 10 Archie NaN NaN NaN NaN
3 891689557279858688 NaN NaN 2017-07-30 15:58:51 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> This is Darla. She commenced a snooze mid meal. 13/10 happens to the best of us https://t.co/tD36da7qLQ NaN NaN NaN https://twitter.com/dog_rates/status/891689557279858688/photo/1 13 10 Darla NaN NaN NaN NaN
4 891327558926688256 NaN NaN 2017-07-29 16:00:24 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> This is Franklin. He would like you to stop calling him "cute." He is a very fierce shark and should be respected as such. 12/10 #BarkWeek https://t.co/AtUZn91f7f NaN NaN NaN https://twitter.com/dog_rates/status/891327558926688256/photo/1,https://twitter.com/dog_rates/status/891327558926688256/photo/1 12 10 Franklin NaN NaN NaN NaN
In [7]:
archive['source'].value_counts()
Out[7]:
source
<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>     2221
<a href="http://vine.co" rel="nofollow">Vine - Make a Scene</a>                          91
<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>                       33
<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>      11
Name: count, dtype: int64
In [8]:
archive['text'].head()
Out[8]:
0                                                         This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU
1    This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10 https://t.co/0Xxu71qeIV
2                     This is Archie. He is a rare Norwegian Pouncing Corgo. Lives in the tall grass. You never know when one may strike. 12/10 https://t.co/wUnZnhtVJB
3                                                               This is Darla. She commenced a snooze mid meal. 13/10 happens to the best of us https://t.co/tD36da7qLQ
4    This is Franklin. He would like you to stop calling him "cute." He is a very fierce shark and should be respected as such. 12/10 #BarkWeek https://t.co/AtUZn91f7f
Name: text, dtype: object
In [9]:
archive.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        1611 non-null   object 
 13  doggo                       97 non-null     object 
 14  floofer                     10 non-null     object 
 15  pupper                      257 non-null    object 
 16  puppo                       30 non-null     object 
dtypes: float64(4), int64(3), object(10)
memory usage: 313.0+ KB
In [10]:
numerator_count = archive['rating_numerator'].value_counts().sort_index()

# Transposing the data saves screen space.
numerator_count_df_T = pd.DataFrame(numerator_count).T
numerator_count_df_T.rename({'rating_numerator':'numerator_count'}, inplace=True)
numerator_count_df_T
Out[10]:
rating_numerator 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 17 20 24 26 27 44 45 50 60 75 80 84 88 99 121 143 144 165 182 204 420 666 960 1776
count 2 9 9 19 17 37 32 55 102 158 461 464 558 351 54 2 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 2 1 1 1
In [11]:
archive.query('rating_numerator > 14')[['text','rating_numerator','rating_denominator']].head(10)
Out[11]:
text rating_numerator rating_denominator
55 @roushfenway These are good dogs but 17/10 is an emotional impulse rating. More like 13/10s 17 10
188 @dhmontgomery We also gave snoop dogg a 420/10 but I think that predated your research 420 10
189 @s8n You tried very hard to portray this good boy as not so good, but you have ultimately failed. His goodness shines through. 666/10 666 10
285 RT @KibaDva: I collected all the good dogs!! 15/10 @dog_rates #GoodDogs https://t.co/6UCGFczlOI 15 10
290 @markhoppus 182/10 182 10
291 @bragg6of8 @Andy_Pace_ we are still looking for the first 15/10 15 10
313 @jonnysun @Lin_Manuel ok jomny I know you're excited but 960/00 isn't a valid rating, 13/10 is tho 960 0
340 RT @dog_rates: This is Logan, the Chow who lived. He solemnly swears he's up to lots of good. H*ckin magical af 9.75/10 https://t.co/yBO5wu… 75 10
433 The floofs have been released I repeat the floofs have been released. 84/70 https://t.co/NIYC820tmd 84 70
516 Meet Sam. She smiles 24/7 &amp; secretly aspires to be a reindeer. \nKeep Sam smiling by clicking and sharing this link:\nhttps://t.co/98tB8y7y7t https://t.co/LouL5vdvxx 24 7

📝 The 84/70 rating is for a group of dogs. This warrants further analysis...

In [12]:
denominator_count = archive['rating_denominator'].value_counts().sort_index()

# Transposing the data saves screen space.
denominator_count_df_T = pd.DataFrame(denominator_count).T
denominator_count_df_T.rename({'rating_denominator':'denominator_count'}, inplace=True)
denominator_count_df_T
Out[12]:
rating_denominator 0 2 7 10 11 15 16 20 40 50 70 80 90 110 120 130 150 170
count 1 1 1 2333 3 1 1 2 1 3 1 2 1 1 1 1 1 1
In [13]:
img_predictions.head()
Out[13]:
tweet_id jpg_url img_num p1 p1_conf p1_dog p2 p2_conf p2_dog p3 p3_conf p3_dog
0 666020888022790149 https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg 1 Welsh_springer_spaniel 0.465074 True collie 0.156665 True Shetland_sheepdog 0.061428 True
1 666029285002620928 https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg 1 redbone 0.506826 True miniature_pinscher 0.074192 True Rhodesian_ridgeback 0.072010 True
2 666033412701032449 https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg 1 German_shepherd 0.596461 True malinois 0.138584 True bloodhound 0.116197 True
3 666044226329800704 https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg 1 Rhodesian_ridgeback 0.408143 True redbone 0.360687 True miniature_pinscher 0.222752 True
4 666049248165822465 https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg 1 miniature_pinscher 0.560311 True Rottweiler 0.243682 True Doberman 0.154629 True
In [14]:
img_predictions.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   tweet_id  2075 non-null   int64  
 1   jpg_url   2075 non-null   object 
 2   img_num   2075 non-null   int64  
 3   p1        2075 non-null   object 
 4   p1_conf   2075 non-null   float64
 5   p1_dog    2075 non-null   bool   
 6   p2        2075 non-null   object 
 7   p2_conf   2075 non-null   float64
 8   p2_dog    2075 non-null   bool   
 9   p3        2075 non-null   object 
 10  p3_conf   2075 non-null   float64
 11  p3_dog    2075 non-null   bool   
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB
In [15]:
api_data.head()
Out[15]:
id retweet_count favorite_count
0 892420643555336193 8853 39467
1 892177421306343426 6514 33819
2 891815181378084864 4328 25461
3 891689557279858688 8964 42908
4 891327558926688256 9774 41048
In [16]:
api_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2354 entries, 0 to 2353
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype
---  ------          --------------  -----
 0   id              2354 non-null   int64
 1   retweet_count   2354 non-null   int64
 2   favorite_count  2354 non-null   int64
dtypes: int64(3)
memory usage: 55.3 KB

Quality issues¶

Archive¶

  1. Includes retweets. Per the retweeted_ columns, there are 181 tweets.

  2. Includes tweets without URLs. There are 2356 entries, but only 2297 of them have expanded_urls.

  3. The source data includes HTML tags.

  4. The text data includes URLs referring back to the tweet itself. These URLs are shortened forms of those found in the extended_urls column.

  5. Some of the expanded_urls values consist of duplicate URLs strung together, separated by commas.

  6. The rating_numerator column

    • lists only the fractional part of decimal numbers,
    • lists the first numerator found in the text of each tweet—whether it's the rating or not.
  1. The rating_denominator column likewise lists the first denominator found in the text of each tweet—whether it's the rating or not.

  2. The id columns should be object-type, not float or int. And the timestamp column should be datetime.

Tidiness issues¶

  1. The tweet data spans multiple tables. The api_data consists of additional data on the tweets in archive, and the img_predictions are largely for the images in those tweets. As all three tables pertain to the same crop of tweets, they should be merged into a super table.

  2. The "dog stage" data spans multiple columns: doggo, floofer, pupper, and puppo are values of the "dog stage" variable.

Cleaning Data¶

In this section, clean all of the issues you documented while assessing.

Note: Make a copy of the original data before cleaning. Cleaning includes merging individual pieces of data according to the rules of tidy data. The result should be a high-quality and tidy master pandas DataFrame (or DataFrames, if appropriate).

In [17]:
# Make copies of original pieces of data
archive_clean = archive.copy()
api_data_clean = api_data.copy()
img_predict_clean = img_predictions.copy()

Issue #1: The 'archive' table includes retweets.¶

Define: Query tweets without retweeted_ data.* And drop the now useless retweeted_ columns.¶

*If a tweet has `retweeted_` data, it's a retweet.

Code¶

In [18]:
archive_clean.query('retweeted_status_id.isna()', inplace=True)
archive_clean.drop(columns=['retweeted_status_id',
                            'retweeted_status_user_id',
                            'retweeted_status_timestamp',
                            ], inplace=True)

Test¶

In [19]:
archive_clean.info()
<class 'pandas.core.frame.DataFrame'>
Index: 2175 entries, 0 to 2355
Data columns (total 14 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   tweet_id               2175 non-null   int64  
 1   in_reply_to_status_id  78 non-null     float64
 2   in_reply_to_user_id    78 non-null     float64
 3   timestamp              2175 non-null   object 
 4   source                 2175 non-null   object 
 5   text                   2175 non-null   object 
 6   expanded_urls          2117 non-null   object 
 7   rating_numerator       2175 non-null   int64  
 8   rating_denominator     2175 non-null   int64  
 9   name                   1495 non-null   object 
 10  doggo                  87 non-null     object 
 11  floofer                10 non-null     object 
 12  pupper                 234 non-null    object 
 13  puppo                  25 non-null     object 
dtypes: float64(2), int64(3), object(9)
memory usage: 254.9+ KB

Issue #2: The 'archive' table includes tweets without URLs.¶

Define: Drop the tweets without expanded_urls.¶

Code¶

In [20]:
archive_clean.dropna(subset='expanded_urls', inplace=True)

Test¶

In [21]:
archive_clean.info()
<class 'pandas.core.frame.DataFrame'>
Index: 2117 entries, 0 to 2355
Data columns (total 14 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   tweet_id               2117 non-null   int64  
 1   in_reply_to_status_id  23 non-null     float64
 2   in_reply_to_user_id    23 non-null     float64
 3   timestamp              2117 non-null   object 
 4   source                 2117 non-null   object 
 5   text                   2117 non-null   object 
 6   expanded_urls          2117 non-null   object 
 7   rating_numerator       2117 non-null   int64  
 8   rating_denominator     2117 non-null   int64  
 9   name                   1495 non-null   object 
 10  doggo                  84 non-null     object 
 11  floofer                10 non-null     object 
 12  pupper                 232 non-null    object 
 13  puppo                  24 non-null     object 
dtypes: float64(2), int64(3), object(9)
memory usage: 248.1+ KB

Issue #3: The tweet data spans multiple tables.¶

Define: Merge the tables, renaming the key columns as needed.¶

Code¶

In [22]:
# The key columns are variously named `tweet_id` and `id`.
api_data_clean.rename(columns={'id':'tweet_id'}, inplace=True)
img_predict_clean.rename(columns={'id':'tweet_id'}, inplace=True)
 
archive_clean = pd.merge(api_data_clean, archive_clean,
                         how='inner', on='tweet_id')
archive_clean = pd.merge(archive_clean, img_predict_clean,
                         how='inner', on='tweet_id')

Test¶

In [23]:
archive_clean.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1994 entries, 0 to 1993
Data columns (total 27 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   tweet_id               1994 non-null   int64  
 1   retweet_count          1994 non-null   int64  
 2   favorite_count         1994 non-null   int64  
 3   in_reply_to_status_id  23 non-null     float64
 4   in_reply_to_user_id    23 non-null     float64
 5   timestamp              1994 non-null   object 
 6   source                 1994 non-null   object 
 7   text                   1994 non-null   object 
 8   expanded_urls          1994 non-null   object 
 9   rating_numerator       1994 non-null   int64  
 10  rating_denominator     1994 non-null   int64  
 11  name                   1448 non-null   object 
 12  doggo                  74 non-null     object 
 13  floofer                8 non-null      object 
 14  pupper                 212 non-null    object 
 15  puppo                  23 non-null     object 
 16  jpg_url                1994 non-null   object 
 17  img_num                1994 non-null   int64  
 18  p1                     1994 non-null   object 
 19  p1_conf                1994 non-null   float64
 20  p1_dog                 1994 non-null   bool   
 21  p2                     1994 non-null   object 
 22  p2_conf                1994 non-null   float64
 23  p2_dog                 1994 non-null   bool   
 24  p3                     1994 non-null   object 
 25  p3_conf                1994 non-null   float64
 26  p3_dog                 1994 non-null   bool   
dtypes: bool(3), float64(5), int64(6), object(13)
memory usage: 379.8+ KB

Issue #4: The "dog stage" data spans multiple columns: doggo, floofer, pupper, and puppo are values of the "dog stage" variable.¶

Most of the values are null.

Define: Join the "dog stage" column's values in a new dog_stage column, but only the non-null values. Drop the "dog stage" columns—save dog_stage.¶

Code¶

In [24]:
dog_stage_cols = ['doggo', 'pupper', 'floofer', 'puppo']

# Update! Pandas is no longer reading None as "None" but as nan instead
archive_clean[dog_stage_cols] = archive_clean[dog_stage_cols].fillna('None')

f = lambda x: ', '.join(x[x != 'None'])

archive_clean['dog_stage'] = archive_clean[dog_stage_cols].apply(f, axis = 1)
archive_clean.drop(columns=dog_stage_cols, inplace=True)

Test¶

In [25]:
archive_clean['dog_stage'].loc[10:14]
Out[25]:
10         
11         
12    puppo
13         
14    puppo
Name: dog_stage, dtype: object
In [26]:
archive_clean.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1994 entries, 0 to 1993
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   tweet_id               1994 non-null   int64  
 1   retweet_count          1994 non-null   int64  
 2   favorite_count         1994 non-null   int64  
 3   in_reply_to_status_id  23 non-null     float64
 4   in_reply_to_user_id    23 non-null     float64
 5   timestamp              1994 non-null   object 
 6   source                 1994 non-null   object 
 7   text                   1994 non-null   object 
 8   expanded_urls          1994 non-null   object 
 9   rating_numerator       1994 non-null   int64  
 10  rating_denominator     1994 non-null   int64  
 11  name                   1448 non-null   object 
 12  jpg_url                1994 non-null   object 
 13  img_num                1994 non-null   int64  
 14  p1                     1994 non-null   object 
 15  p1_conf                1994 non-null   float64
 16  p1_dog                 1994 non-null   bool   
 17  p2                     1994 non-null   object 
 18  p2_conf                1994 non-null   float64
 19  p2_dog                 1994 non-null   bool   
 20  p3                     1994 non-null   object 
 21  p3_conf                1994 non-null   float64
 22  p3_dog                 1994 non-null   bool   
 23  dog_stage              1994 non-null   object 
dtypes: bool(3), float64(5), int64(6), object(10)
memory usage: 333.1+ KB
In [27]:
archive_clean['dog_stage'].unique()
Out[27]:
array(['', 'doggo', 'puppo', 'pupper', 'floofer', 'doggo, puppo',
       'doggo, floofer', 'doggo, pupper'], dtype=object)

Issue #4.1: Empty strings are interpreted as non-null.¶

Define: Replace the empty strings with None.¶

Code¶

In [28]:
archive_clean['dog_stage'].replace('', None, inplace=True)

Test¶

In [29]:
archive_clean['dog_stage'].loc[10:14]
Out[29]:
10     None
11     None
12    puppo
13     None
14    puppo
Name: dog_stage, dtype: object
In [30]:
archive_clean['dog_stage'].info()
<class 'pandas.core.series.Series'>
RangeIndex: 1994 entries, 0 to 1993
Series name: dog_stage
Non-Null Count  Dtype 
--------------  ----- 
306 non-null    object
dtypes: object(1)
memory usage: 15.7+ KB

Issue #5: The source data includes HTML tags.¶

Define: Extract the data from between the tags.¶

Code¶

In [31]:
archive_clean['source'] = archive_clean['source'].str.extract(r'(>.+<)')
archive_clean['source'] = archive_clean['source'].str.strip('><')

Test¶

In [32]:
archive_clean['source'].unique()
Out[32]:
array(['Twitter for iPhone', 'Twitter Web Client', 'TweetDeck'],
      dtype=object)

Issue #6: The text data includes URLs referring back to the tweet itself.¶

Define: Remove the tweet URLs.¶

Code¶

In [33]:
regex = r' https:\/\/t.co\/[a-zA-Z0-9]+'

archive_clean['text'] = archive_clean['text'].str.replace(regex, '', regex=True)

Test¶

In [34]:
archive_clean['text'].str.contains('https://t.co').sum()
Out[34]:
26
In [35]:
archive_clean.query('text.str.contains("https://t.co")')['text'].head()
Out[35]:
6      Meet Jax. He enjoys ice cream so much he gets nervous around it. 13/10 help Jax enjoy more things by clicking below\n\nhttps://t.co/Zr4hWfAs1H
26                        This is Mingus. He's a wonderful father to his smol pup. Confirmed 13/10, but he needs your help\n\nhttps://t.co/bVi0Yr4Cff
63     Meet Shadow. In an attempt to reach maximum zooming borkdrive, he tore his ACL. Still 13/10 tho. Help him out below\n\nhttps://t.co/245xJJElsY
80     This is Sierra. She's one precious pupper. Absolute 12/10. Been in and out of ICU her whole life. Help Sierra below\n\nhttps://t.co/Xp01EU3qyD
102     Say hello to Cooper. His expression is the same wet or dry. Absolute 12/10 but Coop desperately requests your help\n\nhttps://t.co/ZMTE4Mr69f
Name: text, dtype: object

✅ These URLs—separated by newlines—point to fundraiser pages: they are part of the tweet.

Issue #7: Some of the expanded_urls values consist of duplicate URLs strung together, separated by commas.¶

Define: Split the multi-URL strings into URL lists; convert them to sets and then back again (sets don't allow duplicates).¶

Code¶

In [36]:
archive_clean['expanded_urls'] = archive_clean['expanded_urls'].str.split(',')
In [37]:
f = lambda x: [*set(x)]

archive_clean['expanded_urls'] = archive_clean['expanded_urls'].map(f)

Test¶

In [38]:
multi_url = []

[multi_url.append(i) for i in archive_clean['expanded_urls'] if len(i) > 1]

multi_url[:5]
Out[38]:
[['https://twitter.com/dog_rates/status/890971913173991426/photo/1',
  'https://gofundme.com/ydvmve-surgery-for-jax'],
 ['https://twitter.com/dog_rates/status/886736880519319552/photo/1',
  'https://www.gofundme.com/mingusneedsus'],
 ['https://twitter.com/dog_rates/status/878281511006478336/photo/1',
  'https://www.gofundme.com/3yd6y1c'],
 ['https://www.gofundme.com/help-my-baby-sierra-get-better',
  'https://twitter.com/dog_rates/status/873213775632977920/photo/1'],
 ['https://twitter.com/dog_rates/status/870656317836468226/photo/1',
  'https://www.gofundme.com/help-fix-codys-torn-acl']]

⚠️ While not duplicates per se, these fundraiser URLs are somewhat redundant with their shortened versions in the test data (above).

Issue #8: The rating_numerator column¶

  • lists only the fractional part of decimal numbers,
  • lists the first numerator found in the text of each tweet—whether it's the rating or not.

Define: Replace the rating_numerator values with complete rating numerators—integers and all—extracted from the text data.¶

Code¶

In [39]:
# Find all sequences resembling a numerator and select the last one. The ratings are at
# the end.
archive_clean['rating_numerator'] = archive_clean['text'].str.findall(r'(\d*\.?\d+\/)').str[-1]
archive_clean['rating_numerator'] = archive_clean['rating_numerator'].str.rstrip('/')

Test¶

In [40]:
archive_clean.query('text.str.contains(r"\d+\.\d+\/")')[['text','rating_numerator']]
Out[40]:
text rating_numerator
39 This is Bella. She hopes her smile made you smile. If not, she is also offering you her favorite monkey. 13.5/10 13.5
503 This is Logan, the Chow who lived. He solemnly swears he's up to lots of good. H*ckin magical af 9.75/10 9.75
553 This is Sophie. She's a Jubilant Bush Pupper. Super h*ckin rare. Appears at random just to smile at the locals. 11.27/10 would smile back 11.27
1374 Here we have uncovered an entire battalion of holiday puppers. Average of 11.26/10 11.26
In [41]:
archive_clean[['text','rating_numerator']]
Out[41]:
text rating_numerator
0 This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 13
1 This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10 13
2 This is Archie. He is a rare Norwegian Pouncing Corgo. Lives in the tall grass. You never know when one may strike. 12/10 12
3 This is Darla. She commenced a snooze mid meal. 13/10 happens to the best of us 13
4 This is Franklin. He would like you to stop calling him "cute." He is a very fierce shark and should be respected as such. 12/10 #BarkWeek 12
... ... ...
1989 Here we have a 1949 1st generation vulpix. Enjoys sweat tea and Fox News. Cannot be phased. 5/10 5
1990 This is a purebred Piers Morgan. Loves to Netflix and chill. Always looks like he forgot to unplug the iron. 6/10 6
1991 Here is a very happy pup. Big fan of well-maintained decks. Just look at that tongue. 9/10 would cuddle af 9
1992 This is a western brown Mitsubishi terrier. Upset about leaf. Actually 2 dogs here. 7/10 would walk the shit out of 7
1993 Here we have a Japanese Irish Setter. Lost eye in Vietnam (?). Big fan of relaxing on stair. 8/10 would pet 8

1994 rows × 2 columns

In [42]:
archive_clean['rating_numerator'].unique()
Out[42]:
array(['13', '12', '14', '13.5', '11', '6', '10', '0', '84', '24', '9.75',
       '5', '11.27', '3', '.10', '7', '8', '9', '4', '165', '1776', '.11',
       '.9', '204', '99', '80', '.12', '45', '60', '2', '44', '.8', '143',
       '121', '11.26', '144', '88', '1', '420'], dtype=object)
In [43]:
archive_clean.query('rating_numerator == ".10"')[['text','rating_numerator']].head()
Out[43]:
text rating_numerator
586 This is Finn. He's very nervous for the game. Has a lot of money riding on it.10/10 would attempt to comfort .10
734 What jokester sent in a pic without a dog in it? This is not @rock_rates. This is @dog_rates. Thank you ...10/10 .10
751 Guys pls stop sending actual sharks. It's too dangerous for me and the people taking the photos. Thank you ...10/10 .10
803 This is getting incredibly frustrating. This is a Mexican Golden Beaver. We only rate dogs. Only send dogs ...10/10 .10
808 This... is a Tyrannosaurus rex. We only rate dogs. Please only send in dogs. Thank you ...10/10 .10

Issue #8.1: Periods immediately preceding the rating numerators were picked up by the RegEx.¶

Define: Strip preceding periods from the rating_numerator values.¶

Code¶

In [44]:
archive_clean['rating_numerator'] = archive_clean['rating_numerator'].str.lstrip('.')

Test¶

In [45]:
archive_clean['rating_numerator'].unique()
Out[45]:
array(['13', '12', '14', '13.5', '11', '6', '10', '0', '84', '24', '9.75',
       '5', '11.27', '3', '7', '8', '9', '4', '165', '1776', '204', '99',
       '80', '45', '60', '2', '44', '143', '121', '11.26', '144', '88',
       '1', '420'], dtype=object)

Issue #9: The rating_denominator column lists the first denominator* encountered in the text of each tweet.¶

Conversely, the rating denominators come at the end.

Define: List the first denominator found in the text starting from the end.¶

Code¶

In [46]:
# The `[\s.]` verifies that the pattern isn't part of a URL. And the `\/\d+$` matches ratings at
# the very end of the text.
archive_clean['rating_denominator'] = archive_clean['text'].str.findall(r'(\/\d+[\s.,!]+|\/\d+$)').str[-1]
archive_clean['rating_denominator'] = archive_clean['rating_denominator'].str.strip('/.')

Test¶

In [47]:
archive_clean.query('rating_denominator.isna()')[['text','rating_numerator']].head()
Out[47]:
text rating_numerator
723 Meet Jax &amp; Jil. Jil is yelling the pledge of allegiance. If u cant take the freedom get out the kitchen Jax. 10/10s 10
855 "Challenge completed" \n(pupgraded to 12/10) 12
1304 Gang of fearless hoofed puppers here. Straight savages. Elevated for extra terror. Front one has killed before 6/10s 6
1332 Meet Joey and Izzy. Joey only has one ear that works and Izzy wants 2015 to be over already. Both great pups. 11/10s 11
1496 🎶 HELLO FROM THE OTHER SIIIIIIIIDE 🎶 10/10s 10
In [48]:
archive_clean['rating_denominator'].unique()
Out[48]:
array(['10', '10 ', '10, ', '10. ', '10\n\n', '10  ', '10 \n', '70', '7 ',
       '150', nan, '170 ', '10\n', '90 ', '80', '10! ', '50 ', '40 ',
       '10 \n\n', '130. ', '110', '120 ', '80 '], dtype=object)

Issue #9.1: The RegEx pattern¶

  • isn't picking up denominators that are immediately followed by 's' or ')' characters,
  • is grabbing myriad, excess characters along with some of the denominators.

Define: Add 's' and ')' to the RegEx, and strip the excess characters.¶

Code¶

In [49]:
archive_clean['rating_denominator'] = archive_clean['text'].str.findall(r'(\/\d+[\s.,!)s]+|\/\d+$)').str[-1]
archive_clean['rating_denominator'] = archive_clean['rating_denominator'].str.strip('/.,!)s \n')

Test¶

In [50]:
archive_clean['rating_denominator'].unique()
Out[50]:
array(['10', '70', '7', '150', '170', '90', '80', '50', '40', '130',
       '110', '120'], dtype=object)
In [51]:
archive_clean.query('rating_denominator != "10"')[['text','rating_denominator']].head()
Out[51]:
text rating_denominator
323 The floofs have been released I repeat the floofs have been released. 84/70 70
385 Meet Sam. She smiles 24/7 &amp; secretly aspires to be a reindeer. \nKeep Sam smiling by clicking and sharing this link:\nhttps://t.co/98tB8y7y7t 7
662 Why does this never happen at my front door... 165/150 150
848 Say hello to this unbelievably well behaved squad of doggos. 204/170 would try to pet all at once 170
946 Happy Saturday here's 9 puppers on a bench. 99/90 good work everybody 90

Issue #10: The id columns should be object-type, not float or int. And the timestamp column should be datetime.¶

Define: Cast the id columns as object-type and the timestamp column as datetime.¶

Code¶

In [52]:
id_col_type = {'tweet_id': object,
               'in_reply_to_status_id': object,
               'in_reply_to_user_id': object,
               }

archive_clean = archive_clean.astype(id_col_type)
archive_clean['timestamp'] = pd.to_datetime(archive_clean['timestamp'])

Test¶

In [53]:
archive_clean.dtypes
Out[53]:
tweet_id                              object
retweet_count                          int64
favorite_count                         int64
in_reply_to_status_id                 object
in_reply_to_user_id                   object
timestamp                datetime64[ns, UTC]
source                                object
text                                  object
expanded_urls                         object
rating_numerator                      object
rating_denominator                    object
name                                  object
jpg_url                               object
img_num                                int64
p1                                    object
p1_conf                              float64
p1_dog                                  bool
p2                                    object
p2_conf                              float64
p2_dog                                  bool
p3                                    object
p3_conf                              float64
p3_dog                                  bool
dog_stage                             object
dtype: object

Issue #10.1: While cleaning issues #8 and #9, the rating_ columns were overwritten with number strings.¶

Define: Cast the rating_ columns as float-type.¶

Code¶

In [54]:
col_type_plus = {'rating_numerator': float,
                 'rating_denominator': float,
                 }

archive_clean = archive_clean.astype(col_type_plus)

Test¶

In [55]:
archive_clean[['rating_numerator', 'rating_denominator']].dtypes
Out[55]:
rating_numerator      float64
rating_denominator    float64
dtype: object

Storing Data¶

Save gathered, assessed, and cleaned master dataset to a CSV file named "twitter_archive_master.csv".

In [56]:
archive_clean.to_csv('twitter_archive_master.csv', index=False)
In [57]:
df = pd.read_csv('twitter_archive_master.csv', dtype=id_col_type, parse_dates=['timestamp'])

Analyzing and Visualizing Data¶

In this section, analyze and visualize your wrangled data. You must produce at least three (3) insights and one (1) visualization.

In [58]:
rating_cols = ['rating_numerator','rating_denominator','text','jpg_url']

df[rating_cols].query('rating_denominator > 10')
Out[58]:
rating_numerator rating_denominator text jpg_url
323 84.0 70.0 The floofs have been released I repeat the floofs have been released. 84/70 https://pbs.twimg.com/media/C2OtWr0VQAEnS9r.jpg
662 165.0 150.0 Why does this never happen at my front door... 165/150 https://pbs.twimg.com/ext_tw_video_thumb/758467147756691456/pu/img/YTNzjRFDSPNXukmM.jpg
848 204.0 170.0 Say hello to this unbelievably well behaved squad of doggos. 204/170 would try to pet all at once https://pbs.twimg.com/media/CiWWhVNUYAAab_r.jpg
946 99.0 90.0 Happy Saturday here's 9 puppers on a bench. 99/90 good work everybody https://pbs.twimg.com/media/CehIzzZWQAEyHH5.jpg
970 80.0 80.0 Here's a brigade of puppers. All look very prepared for whatever happens next. 80/80 https://pbs.twimg.com/media/CdzETn4W4AAVU5N.jpg
988 45.0 50.0 From left to right:\nCletus, Jerome, Alejandro, Burp, &amp; Titson\nNone know where camera is. 45/50 would hug all at once https://pbs.twimg.com/media/CdeUKpcWoAAJAWJ.jpg
1054 60.0 50.0 Here is a whole flock of puppers. 60/50 I'll take the lot https://pbs.twimg.com/media/CcVOJEcXEAM0FHL.jpg
1130 44.0 40.0 Happy Wednesday here's a bucket of pups. 44/40 would pet all at once https://pbs.twimg.com/media/Ca3i7CzXIAMLhg8.jpg
1302 143.0 130.0 Two sneaky puppers were not initially seen, moving the rating to 143/130. Please forgive us. Thank you https://pbs.twimg.com/media/CX7br3HWsAAQ9L1.jpg
1303 121.0 110.0 Someone help the girl is being mugged. Several are distracting her while two steal her shoes. Clever puppers 121/110 https://pbs.twimg.com/media/CX7Y_ByWwAEJdUy.jpg
1435 144.0 120.0 IT'S PUPPERGEDDON. Total of 144/120 ...I think https://pbs.twimg.com/media/CWe7kw9W4AE8UJh.jpg
1494 88.0 80.0 Here we have an entire platoon of puppers. Total score: 88/80 would pet all at once https://pbs.twimg.com/media/CWEcxqWVEAAHyGH.jpg

Insight #1: The rating_denominator is $10 \times n$, where $n$ is the number of dogs.¶

  • There's normally one dog, and the normal denominator is 10.
  • 988 has five dogs and a denominator of 50.
  • 946 has nine puppies and a rating of 99 out of 90—an average rating of 11/10.
In [59]:
# Normalization of ratings
df['normalized_rating'] = df['rating_numerator'] / df['rating_denominator'] * 10

df['normalized_rating'].describe()
Out[59]:
count    1994.000000
mean       11.646472
std        40.643919
min         0.000000
25%        10.000000
50%        11.000000
75%        12.000000
max      1776.000000
Name: normalized_rating, dtype: float64
In [60]:
q1, q3 = df['normalized_rating'].quantile([0.25, 0.75])
iqr = q3 - q1

lower_limit = q1 - 1.5*iqr
upper_limit = q3 + 1.5*iqr
In [61]:
df_sans_outlier = df.query('normalized_rating >= @lower_limit & normalized_rating <= @upper_limit')

df_sans_outlier['normalized_rating'].describe()
Out[61]:
count    1875.000000
mean       10.930016
std         1.562598
min         7.000000
25%        10.000000
50%        11.000000
75%        12.000000
max        14.000000
Name: normalized_rating, dtype: float64
In [62]:
# Pre-normalization
df['rating_numerator'].mean()
Out[62]:
12.215035105315946
In [63]:
df_outliers = df.query('normalized_rating < @lower_limit | normalized_rating > @upper_limit')

df_outliers[rating_cols].query('rating_numerator > @upper_limit')
Out[63]:
rating_numerator rating_denominator text jpg_url
385 24.0 7.0 Meet Sam. She smiles 24/7 &amp; secretly aspires to be a reindeer. \nKeep Sam smiling by clicking and sharing this link:\nhttps://t.co/98tB8y7y7t https://pbs.twimg.com/media/C0EyPZbXAAAceSc.jpg
726 1776.0 10.0 This is Atticus. He's quite simply America af. 1776/10 https://pbs.twimg.com/media/CmgBZ7kWcAAlzFD.jpg
1718 420.0 10.0 After so many requests... here you go.\n\nGood dogg. 420/10 https://pbs.twimg.com/media/CU9P717W4AAOlKx.jpg

Insight #1.1: Normalized and stripped of outliers, the mean rating is 10.9 out of 10.¶

  • With outliers, the mean is 11.7.
    • Largely due to a couple of gag tweets (above).
  • Prior to normalization, the mean rating was 12.2.
In [64]:
conf_cols = ['p1_conf','p2_conf','p3_conf']

df_sans_outlier[conf_cols].describe()
Out[64]:
p1_conf p2_conf p3_conf
count 1875.000000 1.875000e+03 1.875000e+03
mean 0.598726 1.339234e-01 5.989640e-02
std 0.271115 1.004777e-01 5.096227e-02
min 0.044333 1.011300e-08 1.740170e-10
25% 0.367655 5.352315e-02 1.600860e-02
50% 0.599076 1.175080e-01 4.899880e-02
75% 0.848720 1.939745e-01 9.153815e-02
max 1.000000 4.880140e-01 2.734190e-01

Insight #2: The neural network is not very confident in its dog breed predictions.¶

  • The median confidence level is 0.60, and that's for its best guesses!
  • The five-number summary is: | | | | --- | --- | | minimum | 4% | | 1st quartile | 37% | | median | 60% | | 3rd quartile | 85% | | maximum | 100% |
In [65]:
# If we are to analyze the tweets on a per-breed basis, we need to be reasonably sure that
# a Pug is—in fact—a Pug.
by_breed = df_sans_outlier.query('p1_conf >= 0.80 & p1_dog == True').groupby('p1')
In [66]:
num_dogs = by_breed['normalized_rating'].count()
num_dogs.rename('number_of_dogs', inplace=True)
Out[66]:
p1
Afghan_hound                       1
Airedale                           3
American_Staffordshire_terrier     5
Bernese_mountain_dog               4
Blenheim_spaniel                   6
                                  ..
schipperke                         3
soft-coated_wheaten_terrier        2
toy_poodle                        11
vizsla                             5
whippet                            1
Name: number_of_dogs, Length: 74, dtype: int64
In [67]:
mean_rating = by_breed['normalized_rating'].mean()
mean_rating.rename('mean_rating', inplace=True)
Out[67]:
p1
Afghan_hound                      13.000000
Airedale                          11.000000
American_Staffordshire_terrier    11.200000
Bernese_mountain_dog              11.250000
Blenheim_spaniel                  11.166667
                                    ...    
schipperke                        11.000000
soft-coated_wheaten_terrier       11.500000
toy_poodle                        11.363636
vizsla                            10.800000
whippet                            9.000000
Name: mean_rating, Length: 74, dtype: float64
In [68]:
df_by_breed = pd.DataFrame([num_dogs, mean_rating]).T
df_by_breed.sort_values('mean_rating')
Out[68]:
number_of_dogs mean_rating
p1
whippet 1.0 9.0
English_setter 1.0 9.0
Italian_greyhound 3.0 9.0
basenji 2.0 9.5
keeshond 3.0 10.0
... ... ...
briard 1.0 13.0
Leonberg 1.0 13.0
Afghan_hound 1.0 13.0
black-and-tan_coonhound 1.0 14.0
Gordon_setter 1.0 14.0

74 rows × 2 columns

In [69]:
# In this case, slicing the `mean_rating` column outputs the data more legibly: monospaced values
# make up-and-down comparisons easier.
df_by_breed.groupby('number_of_dogs')['mean_rating'].mean().sort_index()
Out[69]:
number_of_dogs
1.0     11.403889
2.0     10.708333
3.0     10.857143
4.0     10.714286
5.0     10.828571
6.0     10.666667
7.0     11.285714
8.0     11.125000
9.0     10.888889
11.0    11.363636
14.0    11.357143
16.0    11.875000
18.0    10.500000
21.0    11.190476
22.0    11.909091
30.0    10.600000
33.0    11.515152
38.0    11.973684
63.0    11.880952
Name: mean_rating, dtype: float64

Insight #3: Being unique appears to give a ratings boost.¶

  • The highest-rated dogs are all unique.
  • On average, the most popular breeds tend to be higher rated than the least common breeds:
    • Unique breeds are the exception.
In [70]:
# Update! The numeric columns are no longer selected automatically
by_breed_mean = by_breed[['normalized_rating', 'favorite_count']].mean()
In [71]:
fig = by_breed_mean.plot.scatter(x='normalized_rating',
                                 y='favorite_count',
                                 labels={'normalized_rating':'Normalized Rating',
                                         'favorite_count':'Favorite Count',
                                         },
                                 title='Breed Rating vs Popular Preference',
                                 hover_name=by_breed_mean.index,
                                 trendline='ols')

fig.update_layout(title={'x':0.5,
                         'xanchor':'center',
                         'y':0.85,
                         })

fig.show()
In [72]:
# Finding the correlation coefficient (r) using the R-squared—sourced from
# the OLS trendline tooltip (above).
np.sqrt(0.357845)
Out[72]:
0.5982014710780976

Insight #4: The ratings correlate quite well with popular preference.¶

  • There is a moderate, positive correlation between normalized rating and favorite count.
    • The correlation coefficient ($r$) is 0.5982.
  • Favorite count is a reflection of user preference. The most-favorited dog breeds are the most preferred.

Insights Recap:¶

  1. The rating_denominator is $10 \times n$, where $n$ is the number of dogs.

    1.1. Normalized and stripped of outliers, the mean rating is 10.9 out of 10.

  2. The neural network is not very confident in its dog breed predictions.

  3. Being a unique breed appears to give a ratings boost.

  4. The ratings correlate quite well with popular preference.